arXiv: 1505.00908vl [cs.LG] 5 May 2015 


Reinforced Decision Trees 


Reinforced Decision Trees 


Aurelia Leon 

Sorbonne Universites, 

Ludovic Denoyer 

Sorbonne Universites, 


aurelia.leon@lip6.fr 

UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France 

ludovic.denoyer@lip6.fr 

UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France 

Abstract 


In order to speed-up classification models when facing a large number of categories, one 
usual approach consists in organizing the categories in a particular structure, this structure 
being then used as a way to speed-up the prediction computation. This is for example 
the case when using error-correcting codes or even hierarchies of categories. But in the 
majority of approaches, this structure is chosen by hand, or during a preliminary step, 
and not integrated in the learning process. We propose a new model called Reinforced 
Decision Tree which simultaneously learns how to organize categories in a tree structure 
and how to classify any input based on this structure. This approach keeps the advantages 
of existing techniques (low inference complexity) but allows one to build efficient classifiers 
in one learning step. The learning algorithm is inspired by reinforcement learning and 
policy-gradient techniques which allows us to integrate the two steps (building the tree, 
and learning the classifier) in one single algorithm. 

Keywords: reinforcement learning, machine learning, policy gradient, decision trees, 

classification 


1. Introduction 


The complexity of classification models is usually highly related, and typically linear w.r.t 
the number of possible categories denoted C. When facing problems with a very large 
number of classes, like text classification in large ontologies, object recognition or word pre¬ 
diction in deep learning language models, this becomes a critical point making classification 
methods inefficient in term of inference complexity. There is thus a need to develop new 
methods able to predict in large output spaces at a low cost. 


Several methods have been recently developed for reducing the classification speed. They 
are based on the idea of using a structure that organizes the possible outputs, and allows one 
to reduce the inference complexity. Two main families of approaches have been proposed: 
(i) error-correcting codes approaches (Dietterich and Bakiri; Schapire; Cisse et al.) that 
associate a short code to each category; the classification becomes a code prediction problem 
which can be achieved faster by predicting each element of the code, (ii) The second family 


is hierarchical methods (Bengio et al. Liu et ah, 2013 Weston et al.) where the possible 


categories are leaves of a tree. In that case, an output is predicted by choosing a path in 
the tree like in decision trees. These two methods typically involve a prediction complexity 
of 0{logC) allowing a great speed-up. 

But these approaches suffer from one major drawback; the structure used for prediction 
(error correcting codes, or tree) is usually built during a preliminary step - before learning 
the classifier - following hand-made heuristics, typically by using clustering algorithms or 
Huffman codes (Le and Mikolov, 2014). The problem is that the quality of the obtained 
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structure greatly influences the quality of the final classifier, and this step is thus a critical 
step. 

In this paper, we propose the Reinforced Decision Tree (RDT) model which is able to 
simultaneously learn how to organize categories in a hierarchy and learn the corresponding 
classifier in a single step. The obtained system is a high speed and efficient predictive model 
where the category structure has been fitted for the particular task to solve, while it is built 
by hand in existing approaches. The main idea of RDT is to consider the classification 
problem as a sequential decision process where a learned policy guides any input x in 
a tree structure from the root to one leaf. The learning algorithm is inspired from policy- 


gradient methods (Baxter and Bartlett, 1999) which allows us to act on both the way an 


input X falls into the tree, but also on the categories associated with the leaves of this 
tree. The difference with the classical Reinforcement Learning context is that the feedback 


provided to the system is a derivable loss function as proposed in (Denoyer and Gallinari 


2014) which gives more information than a reward signal, and allows fast learning of the 


parameters. 

The contributions are: 


• A novel model able to simultaneously discover a relevant hierarchy of categories and 
the associated classihcation model in one step. 

• A gradient-based learning method based on injecting a derivable loss function in policy 
gradient algorithms. 

• A first set of preliminary experiments over toy datasets allowing to better understand 
the properties of such an approach. 


2. Notations and Model 

We consider the multi-class classihcation problem where each input x G M” has to be 
associated with one of the C possible categoric^ Let us denote y the label of x, y G 
such that yi = 1 if x belongs to class i and yi = —1 elsewhere. We will denote 
{(x^, 7/^),..., {x^, y^)} the set of N training examples. 

2.1 Reinforced Decision Trees Architecture 

The Reinforced Decision Tree (RDT) architecture shares common points with decision trees. 
Let us denote Te^a such a tree with parameters 0 and a that will be dehned in the later: 

• The tree is composed of a set of nodes denoted nodes{Te,a) = {ni, ...,71^} where T is 
the number of nodes of the tree 

• ni is the root of the tree. 

• parent{ni) = Uj means that node Uj is the parent of n*. Each node but n\ has only 
one parent, ni has no parent since it is the root of the tree. 

1. The model naturally handles multi-label classihcation problems which will not be detailed here for sake 
of simplicity 
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• leaf{ni) = true if and only if n, is a leaf of the tree. 

Note that we do not have any constraint concerning the number of leaves or the topology 
of the tree. Each node of the tree is associated with its own set of parameters: 

• A node Uj is associated with a set of parameters denoted 6i if n* is an internal node 
i.e leaf{ni) = false. 

• A node Ui is associated with a set of parameters denoted a* G if n* is a leaf of the 
tree. 

Parameters 6 = {9i} are the parameters of the policy that will guide an input x from the 
root node to one of the leaf of the tree. Parameters ai correspond to the prediction that 
will be produced when an input reached the leaf n*. Our architecture is thus very close 
to classical decision trees, the major difference being that the prediction associated with 
each leaf is a set of parameters that will be learned during the training process, allowing 
the model to choose how to match the leaves of the tree with the categories. The model 
will thus be able to simultaneously learn which path an input has to follow based on the 6i 
parameters, but also how to organize the categories in the tree based on the parameters. 
An example of RDT is given in Figure 
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Figure 1: An example of RDT. The ttq functions and the a values have been learned in one 
integrated step following a policy-gradient based method. Bold values correspond 
to predicted categories at the leaves level. Note that different leaves can predict 
the same category. The inference speedup comes from the fact that, in this case, 
the score of the 4 categories is computed by taking at most 2 decisions - for 
example , ng ^, and then returning 0:3 as a prediction. 6 and a are learned. 


2.2 Inference Process 

Let us denote H a trajectory, H = ( 77 ( 1 ),..., where G nodes{Te^a) and (i) is the 
index of the i-th node of the trajectory. H is thus a sequence of nodes where 77 ( 3 ) = m, 
Vi > l, 77 (j_i) = parent{n(^i^) and /ea/(77(i)) = true. 
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Algorithm 1 Stochastic Gradient RDT Learning Procedure 
1: procedure LEARNiNG((a;^, j/^), ,y^)) > the training set 

2: e is the learning rate 

3: a « random 

4: 9 ~ random 

5: repeat 

6: i « uniform(l, N) 

7: Sample H = using the 7re(a;*) functions 

8 : a(t) ^ a(t) - a^{ont),y'') 

9: for k G [l..t — 1] do 

10: 6»(fc) ^ 6>(fc) - e (Vg log (a;*)) A(a(t), j/*) 

11: end for 

12: until Convergence 

13: return Te,a 

14: end procedure 


Each internal node rii is associated with a function (or policy) vr^. which role is to 
compute the probability that a given input x will fall in one of the children of Uj. ttq- is 
defined as: 


7:g^ : M"’ X nodes{Te,a) —>• 
TTe.{x,n) = P{n\ni,x) 


with P{n\ni,x) = 0 if n ^ children{ni). In other words, 7rg.{x,n) is the probability that x 
moves from m to n. Note that the different between RDT and DT is that the decision taken 
at each node is stochastic and not deterministic which will allow us to use gradient-based 
learning algorithms. The probability of a trajectory H = given an input x 

can be written as: 

t-i 

P{H\x) = 7r0(.j (x, n(i+i)) (2) 

i=l 

Once a trajectory has been sampled, the prediction produced by the model depends on 
the leaf n(j) reached by x. The model directly outputs Q;(i) as a prediction, being a 
vector in (one score per class) as explained before. Note that the model produces one 
score for each possible category, but the inference complexity of this step is 0(1) since it 
just corresponds to returning the value 

The complete inference process is described in Algorithmic 


1. Sample a trajectory H = (n(i),..., np)) given x, by sequentially using the policies 

2. Returns the predicted output ap) 
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2.3 Learning Process 

The goal of the learning procedure is to simultaneously learn both the policy functions ttq. 
and the output parameters a* in order to minimize a given learning loss denoted A which 
corresponds here to a classification loss (e.g square loss or hinge loss|^ 

Our learning algorithm is based on an extension of policy gradient techniques in¬ 


spired from the Reinforcement Learning literature and similar to Denoyer and Gallinari 


(2014). More precisely, our learning method is close to the methods proposed in 


Baxter 


and Bartlett (1999) with the difference that, instead of considering a reward signal which is 


usual in reinforcement learning, we consider a loss function A. This function computes the 
quality of the system, providing a richer feedback information than simple rewards since it 
can be derivated, and thus gives the direction in which parameters have to be updated. 


The performance of our system is denoted J{6, a): 


J{e,a) = Ep^(^^^H,y)[MFa{x, H),y)] (3) 

where Fa{x, H) is the prediction made following trajectory H - i.e the sequence of nodes cho¬ 
sen by the 7r-functions. The optimization of J can be made by gradient-descent techniques 
and we need to compute the gradient of J: 

^e,aJ{0, a) = J Ve,a {Pe{H\x)A{Fa{x, H),y)) Fix, y)dHdxdy (4) 

This gradient can be simplified such that: 

V 0 ,Q,J( 6 ',a) = j Ve^a (PeiHlx)) A{Fo,ix, H),y)P{x,y)dHdxdy 

+ j Pe{H\x)Ve,-fA{Fa{x,H),y)P{x,y)dHdxdy 

\ ( 5 ) 

= J PeiH\x)Vg^a ilogPg{H\x)) A{Fa{x,H),y)P{x,y)dHdxdy 

+ j Pe{H\x)Vg^aA{Faix,H),y)P{x,y)dHdxdy 

Using the Monte Carlo approximation of this expectation by taking M trail histories 
over the N training examples, and given that A{Fa{x'^, H), y) = A(ap),i/), we obtain: 




1 1 

NM 


N M 


EE 


t-1 


^ (^Iog 7 r 0 j^.j(x*)^ A{a^t),y)) + V 0 ,«A(ap), y) 

i=i 


( 6 ) 


Intuitively, the gradient is composed of two terms: 


• The first term aims at penalizing trajectories with high loss - and thus encouraging to 
find trajectories with low loss. The first term only acts on the 0 parameters, modifying 
the probabilities computed at the internal nodes levels. 

2. Note that our approach is not restricted to classification and can be used for regression for example, see 
Section 1^ 
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• The second term is the gradient computed over the final loss and concerns the a. values 
corresponding to the leaf node where the input x arrives at the end of the process. 
It thus changes the a used for prediction in order to capture the category of x. This 
gradient term is responsible about how to allocate the categories in the leaves of the 
tree. 

The learning algorithm in its stochastic gradient variant is described in Algorithm 

2.4 Discussion 

Inference Complexity: The complexity of the inference process is linear with the depth 
of the tree. Typically, in a multi-class classification problem, the depth of the tree will be 
proportional to logC, resulting in a very high speed inference process similar to the one 
obtained for example using Hierarchical SoftMax modules. 

Learning Complexity: The policy gradient algorithm developed in this paper is an 

iterative gradient-based method. Each learning iteration complexity is 0{N\ogC) but the 
number of needed iterations is not known. Moreover, the optimization problem is clearly 
not a convex problem, and the system can be stuck in a local minimum. As explained in 
Section a way to avoid problematic local minimum is to choose a number of leaves which 
is higher than the number of categories, giving more freedom degrees to the model. 

Using RDT for complex problems Different functions topologies can be used for vr. 
In the following we have used simple linear functions, but more sophisticated ones can be 
tested like neural networks. Moreover, in our model, there is no constraints upon the a 
parameters, nor about the loss function A which only has to be derivable. It thus means 
that our model can also be used for other tasks like multivariate regression, or even for 
producing continuous outputs at a low price. In that case, RDTs act as discretization 
processes where the objective of the task and the discretization are made simultaneously. 

3. Preliminary Experiments 

In this Section, we provide a set of experiments that have been made on toy-datasets 
to better understand the ability of RDT to perform good predictions and to discover a 
relevant hierarch}]^ Our model has been compared to the same model but where the 
categories associated with the leaves (the ctj values) have been chosen randomly, each leaf 
being associated to one possible category - i.e each vector ai is full of —1 with one 1-value 
at a random position - this model is called Random Tree. In that case, the a-parameters are 
not updated during the gradient descent. Hyper-parameters (learning rate and number of 
iterations) have been tuned by cross-validation, and the results have been averaged on five 
different runs. We also made comparisons with a linear SVM (one versus all) and decision 
tree^ These preliminary experiments aim at showing the ability of our integrated approach 
to learn how to associate categories to leaves of the tree. 

We consider simple 2D datasets composed of categories sampled following a Gaussian 
distribution. Each category is composed of 100 vectors (50 for train, 50 for test) sampled 

3. Experiments on real-world datasets are currently made 

4. using the implementation of sklearn 
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Accuracyi Variance 

w 

B 

L 

RDT 

Random 





Trees 

2 

3 

8 

0.46 ± 0.01 

0.18 ± 0.01 

2 

4 

16 

0.70 ± 0.06 

0.25 ± 0.02 

2 

5 

32 

0.83 ± 0.04 

0.26 ± 0.06 

3 

2 

9 

0.51 ± 0.01 

0.25 ± 0.01 

3 

3 

27 

0.75 ± 0.04 

0.24 ± 0.02 

Acc. 

of linear SVM : 

0.50± 0.01 

DT (depth=5) : 

0.46± 0.04 

DT (depth=10) : 

0.80± 0.01 

DT (depth=50) : 

0.79T 0.01 



Figure 2: Performance (left) on 16 categories and (right) corresponding decision frontiers 
of RDT: W is the width of the tree (i.e the number of children per node), D is 
the depth and L is the resulting number of leaves. 


following J\f{nc,crc) where fic has been uniformly sampled between {(-1,-1) and (1,1)} 
and cJc has been also sampled - see Figure [fright for the dataset with (7=16 categories 
and [fright for the dataset with (7 = 32 categories. Two preliminary conclusions can be 
drawn from the presented results: first, one can see that, when considering a particular 
architecture, the RDT is able to determine how to allocate the categories in the leaves of 
the tree: the performance of RDT w.r.t random trees where categories have been randomly 
sampled in the leaves is clearly better, (ii) The second conclusion is that, when considering 
a problem with C categories, a good option is to build a tree with L > C leaves since it 
gives more freedom degrees to the model which will more easily hnd how to allocate the 
categories in the leaves. Note that, for the problem with 32 categories, a 84% accuracy is 
obtained when using a tree of depth 6: only 6 binary classifiers are used for predicting the 
category. In comparison to a linear SVM, where the inference complexity is higher than 
ours {0{C)) our approach performs better. This is mainly due to the ability of our model 
to learn non-linear decision frontiers. At last, when considering decision trees, one can see 
that they are sometimes equivalent to RDT. We think that this is mainly due to the small 
dimension of the input space, and the small number of examples for which decision trees 
are well adapted. 


4. Related Work 

In multi-class classihcation problems, the classical approach is to train one-versus-all clas¬ 


sifiers. It is one of the most efficient technique (Jr and Freitas Babbar and Partalas 2013) 
even with a very large number of classes, but the inference complexity is linear w.r.t the 
number of possible categories resulting in low-speed prediction algorithms. 

Hierarchical models have been proposed to reduce this complexity. They have been 
developed for two cases: (i) a hrst one where a hierarchy of category is already known; in 
that case, the hierarchy of classifiers is mapped on the hierarchy of classes, (ii) A second 
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Accuracyi Variance 

w 

D 

L 

RDT 

Random 





Trees 

2 

5 

32 

0.71 ± 0.04 

0.10 ± 0.02 

2 

6 

64 

0.84 ± 0.01 

0.11 ± 0.03 

3 

3 

27 

0.58 ± 0.04 

0.12 ± 0.04 

3 

4 

81 

0.79 ± 0.04 

0.14 ± 0.06 

Acc. 

of linear SVM : 

0.54± 0.01 

DT (depth=5) : 

0.77± 0.03 

DT (depth=10) : 

0.88± 0.02 

DT (depth=50) : 

0.86± 0.01 



Figure 3: Performance (left) on 32 categories and (right) corresponding decision frontiers 
of RDT: W is the width of the tree (i.e the number of children per node), D is 
the depth and L is the resulting number of leaves. 


approach closer to ours consists in automatically building a hierarchy from the training set. 
This is usually done in a preliminary step by using for example clustering techniques like 


spectral clustering on the confusion matrix (Bengio et al.), using probabilistic label tree (Liu 


et al., 2013) or even partitioning optimization (Weston et ah). Facing these approaches, 


RDT has the advantage to learn the hierarchy and the classifier in an integrated step only 
guided by a unique loss function. The closest work is perhaps [Choromanska and Langford 


(2014) which discovers the hierarchy using online learning algorithms, the construction of the 
tree being made during learning. Other families of methods have been proposed like error- 


correcting codes (Dietterich and Bakiri Schapire Cisse et al.), sparse coding (Zhao and 


Xing, 2013) or even using representation learning techniques, representations of categories 


being obtained by unsupervised models (Weinberger and Chapelle; Bengio et ah). 

At last, the use of sequential learning models, inspired by reinforcement learning, in the 
context of classification or regression has been explored recently for different applications 


like features selection (Dulac-arnold et al.) or image classification (Dulac-Arnold et al. 


2014; Gregor et al., 2015). Our model belongs to this family of approaches. 


5. Conclusion and Perspectives 

We have presented Reinforced Decision Trees which is a learning model able to simultane¬ 
ously learn how to allocate categories in a hierarchy and how to classify inputs. RDTs are 
sequential decision models where the prediction over one input is made using OifogC) clas¬ 
sifiers, making this method suitable for problems with large number of categories. Moreover, 
the method can be easily adapted to any learning problem like regression or ranking, by 
changing the loss function. RDTs are learned by using a policy gradient-inspired methods. 
Preliminary results show the effectiveness of this approach. Future work mainly involves 
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real-world experimentation, but also extension of this model to continuous outputs prob¬ 
lems. 
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